Fp16 nchw for cudnn-fp16 backend (support GTX 16xx GPUs) #849

ankan-ban · 2019-05-11T19:01:41Z

For cudnn-fp16 backend: For supporting GPUs without tensor cores (e.g: GP100 and GTX 16xx series).
TODO:

Figure out a way to check for tensor cores (and select nhwc vs nchw based on that).

done (Unfortunately involves string matching the device name. Hopefully will be fixed in some later cuda release.)

actually run on a gtx 16xx card and check performance.

done (see below). Slightly less than 2x speedup over fp32.

misc cleanup and clang format.

done

(optional): maybe write a fused kernel for SE layer.

maybe later (or maybe not needed. With current implementation SE raw nps is only ~6% lower than non-SE net, tested with 256x20 networks).

use bestmove_is_sent_ for Search::IsSearchActive() (LeelaChessZero#502)

get latest

Get latest

- replace all cudaMemcpyAsync used for loading weights with cudaMemcpy as source (in CPU memory) could be deleted before the async version of the function actually does the copy. - minor naming/style changes. - add comment explaining what the policy map layer does and how the layout conversion from CHW to HWC works.

get latest

- try NCHW layout and winograd alogirhtm for convolutions (same as what we use for fp32). - it's slower than NHWC/fp16 on GPUs with tensor cores, but should give some speedup on GP100 and TU11x GPUs.

ankan-ban · 2019-05-11T19:16:01Z

Some benchmarks on GTX 1650

1. FP32
Benchmark final time 9.77541s calculating 2303.53 nodes per second.

2. fp16 with nhwc (current default)
Benchmark final time 9.35774s calculating 224.627 nodes per second.

3. fp16 with nchw layout - with 'TENSOR_OP_MATH' setting enabled.
Benchmark final time 8.7635s calculating 535.517 nodes per second.

4. fp16 with nchw layout - without 'TENSOR_OP_MATH' setting enabled.
Benchmark final time 8.67777s calculating 4238.88 nodes per second.

Its surprising that the fp16/nhwc path even works on GTX 1650. Maybe cudnn/cublas is just emulating it and that's why it's so slow. Even with nchw path, if 'TENSOR_OP_MATH' flag is enabled, it's still very slow (again likely because it has to emulate tensor cores somehow).

Good news is fp16/nchw layout without the 'TENSOR_OP_MATH' is almost 2x faster than fp32.

- not sure why Visual C works fine!

- GP100 (SM6.0) - GTX 16xx GPUs (unfortunately same sm 7.5 version so need a string compare)

default is auto-select (-1).

src/neural/cuda/common_kernels.cu

src/neural/cuda/network_cudnn.cc

Use bool option instead of int and use IsDefault mechanism to check if the option was forced or not.

mooskagh · 2019-05-22T20:18:08Z

lc0@exe/lc0@exe.log

@@ -0,0 +1,2 @@
+  layers.cc
+  lc0@exe.vcxproj -> C:\Ankan\git\ankan\lc0\build\.\lc0.exe


What is this file? :)

Sorry. Likely some intermediate build file that accidentally got submitted. Will remove it.

rajb245 · 2019-09-03T22:15:23Z

Does the merged work apply only to the cards using the GP100, i.e., the Quadro GP100 and the Tesla P100? Can similar techniques apply to the other Pascal chips, in particular, GP102 chips like the Titan X (Pascal) and Titan Xp have? NVIDIA advertises some level of fp16 acceleration on those, but I don't know enough of the implementation to know the differences.

If there's a path to accelerate performance on GP102 using similar techniques, please let me know and I'll open a feature request issue.

ankan-ban · 2019-09-04T01:56:51Z

Unfortunately no, other Pascal chips (gp102/gp104/gp106, etc) don't have support for fp16 math. They do support higher throughput int8 math but right now we don't have support in lc0 for int8 precision.

jjoshua2 · 2019-09-08T15:33:15Z

5.3 support for fp16 is still missing for jetson. I tested just adding it and it works fine to double the speeds it seems.

…

On Tue, Sep 3, 2019, 9:56 PM Ankan Banerjee ***@***.***> wrote: Unfortunately no, other Pascal chips (gp102/gp104/gp106, etc) don't have support for fp16 math. They do support higher throughput int8 math but right now we don't have support in lc0 for int8 precision. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#849?email_source=notifications&email_token=ADXIQNHGLZAJEHZ7AOXSHRDQH4IWNA5CNFSM4HMIZJKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD52CWKQ#issuecomment-527706922>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADXIQNFLLGDPPCSHHC72HGTQH4IWNANCNFSM4HMIZJKA> .

borg323 · 2019-09-08T20:25:22Z

Do you know whether using fp16 with CC 6.2 (jetson TX2) is also a performance gain?

ankan-ban added 13 commits November 11, 2018 18:09

Merge pull request #3 from LeelaChessZero/master

e3ad2c0

use bestmove_is_sent_ for Search::IsSearchActive() (LeelaChessZero#502)

Merge pull request #4 from LeelaChessZero/master

b2e5114

get latest

Merge pull request #7 from LeelaChessZero/master

beed96e

get latest

Merge pull request #8 from LeelaChessZero/master

80ac4a1

get latest

Merge pull request #10 from LeelaChessZero/master

0f7bc50

get latest

Merge pull request #11 from LeelaChessZero/master

e4737e3

Get latest

fix typo in comment

acfd7c1

clang-format

33f3d57

address review comment

1976777

Merge pull request #13 from LeelaChessZero/master

8f46984

get latest

Merge pull request #14 from LeelaChessZero/master

b8dd014

get latest

support cudnn-fp16 backend on GPUs without tensor cores

7211cda

- try NCHW layout and winograd alogirhtm for convolutions (same as what we use for fp32). - it's slower than NHWC/fp16 on GPUs with tensor cores, but should give some speedup on GP100 and TU11x GPUs.

ankan-ban added the wip Work in progress label May 11, 2019

ankan-ban added 2 commits May 12, 2019 00:52

enable tensor cores only for nhwc and fix build break

dd8c0ae

fix another build break

a73cfe8

- not sure why Visual C works fine!

ankan-ban changed the title ~~Fp16 nchw~~ Fp16 nchw for cudnn-fp16 backend (support GTX 16xx cards) May 12, 2019

ankan-ban changed the title ~~Fp16 nchw for cudnn-fp16 backend (support GTX 16xx cards)~~ Fp16 nchw for cudnn-fp16 backend (support GTX 16xx GPUs) May 12, 2019

ankan-ban added 2 commits May 12, 2019 11:14

add check for cards with no tensor cores

017c07c

- GP100 (SM6.0) - GTX 16xx GPUs (unfortunately same sm 7.5 version so need a string compare)

clang format

445c7c6

ankan-ban removed the wip Work in progress label May 12, 2019

ankan-ban requested a review from borg323 May 12, 2019 07:51

add backend-opt to force nhwc on or off

d8049a6

default is auto-select (-1).

borg323 approved these changes May 12, 2019

View reviewed changes

src/neural/cuda/common_kernels.cu Outdated Show resolved Hide resolved

src/neural/cuda/network_cudnn.cc Show resolved Hide resolved

ankan-ban added 2 commits May 12, 2019 22:02

fix typo

9daed1a

address review comment

124eef1

Use bool option instead of int and use IsDefault mechanism to check if the option was forced or not.

borg323 approved these changes May 12, 2019

View reviewed changes

ankan-ban merged commit fa926e5 into LeelaChessZero:master May 13, 2019

ankan-ban deleted the fp16-nchw branch May 13, 2019 03:46

ankan-ban restored the fp16-nchw branch May 16, 2019 16:58

mooskagh reviewed May 22, 2019

View reviewed changes

borg323 mentioned this pull request Sep 11, 2019

allow using cudnn-fp16 on jetson #945

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fp16 nchw for cudnn-fp16 backend (support GTX 16xx GPUs) #849

Fp16 nchw for cudnn-fp16 backend (support GTX 16xx GPUs) #849

ankan-ban commented May 11, 2019 •

edited

Loading

ankan-ban commented May 11, 2019 •

edited

Loading

mooskagh May 22, 2019

ankan-ban May 23, 2019

rajb245 commented Sep 3, 2019

ankan-ban commented Sep 4, 2019

jjoshua2 commented Sep 8, 2019 via email

borg323 commented Sep 8, 2019

		@@ -0,0 +1,2 @@
		layers.cc
		lc0@exe.vcxproj -> C:\Ankan\git\ankan\lc0\build\.\lc0.exe

Fp16 nchw for cudnn-fp16 backend (support GTX 16xx GPUs) #849

Fp16 nchw for cudnn-fp16 backend (support GTX 16xx GPUs) #849

Conversation

ankan-ban commented May 11, 2019 • edited Loading

ankan-ban commented May 11, 2019 • edited Loading

mooskagh May 22, 2019

Choose a reason for hiding this comment

ankan-ban May 23, 2019

Choose a reason for hiding this comment

rajb245 commented Sep 3, 2019

ankan-ban commented Sep 4, 2019

jjoshua2 commented Sep 8, 2019 via email

borg323 commented Sep 8, 2019

ankan-ban commented May 11, 2019 •

edited

Loading

ankan-ban commented May 11, 2019 •

edited

Loading